Skip to main content

Empirical Risk Minimization

Principle

  • Pick any gn^\hat{g_n} that agrees with the training data (gn^(Xi)=Yi,i=1,,n)(\hat{g_n}(X_i) = Y_i, i = 1,\cdots,n)

  • Question: what's L(gn^)L(\hat{g_n})? What's P(Y>ϵ)P(Y > \epsilon) \leq ?, given ϵ\epsilon > 0

    • Here, L(g)=P(Yg(x))L(g) = P(Y\neq g(x))

    • Let GBG_B = "bad" classification rules ={gG:L(g)>ϵ}=\{g \in G: L(g) > \epsilon\}

      P(L(gn^>ϵ))=P(gn^GB)\begin{gather*} P(L(\hat{g_n} > \epsilon)) = P(\hat{g_n} \in G_B) \end{gather*}

      takes gGBg \in G_B, if gn^=g,g(Xi)=Yi,i=1,2,,n\hat{g_n} = g, g(X_i) = Y_i, i = 1,2,\cdots,n

      P(g(X1)Y1)>ϵP(g(X1)=Y1)1ϵP(g(Xi)=Yi)=P(g(X1)=Y1)×P(g(X2)=Y2)××P(g(Xn)=Yn)=i=1nP(g(Xi)=Yi)(1ϵ)neϵn\begin{align*} P(g(X_1) \neq Y_1) & > \epsilon \\ P(g(X_1) = Y_1) & \leq 1 - \epsilon \\ P(g(X_i) = Y_i) & = P(g(X_1) = Y_1) \times P(g(X_2) = Y_2) \times \cdots \times P(g(X_n) = Y_n) \\ & = \prod_{i = 1}^{n} P(g(X_i) = Y_i) \\ & \leq (1 - \epsilon)^n \leq e^{- \epsilon n} \end{align*}

      Reminder of Union Bound: P(AB)P(A)+P(B)P(A \cup B) \leq P(A) + P(B)
      \therefore if GB={g1,,gk}G_B = \{g_1, \cdots, g_k\}, then

      P(gn^GB)=P(gn^=g1,orgn^=g2,or,orgn^=gk)P(gn^=g1)+P(gn^=g2)++P(gn^=gk)keϵnGeϵn\begin{align*} P(\hat{g_n} \in G_B) & = P(\hat{g_n} = g_1, or\: \hat{g_n} = g_2, or\: \cdots, or\: \hat{g_n} = g_k) \\ & \leq P(\hat{g_n} = g_1) + P(\hat{g_n} = g_2) + \cdots + P(\hat{g_n} = g_k) \\ & \leq ke^{-\epsilon n} \\ & \leq |G|e^{-\epsilon n} \end{align*}
    • If one requires that

      P(L(gn^)ϵ)1δGeϵnδnlog(Gδ)ϵ\begin{align*} P(L(\hat{g_n}) \leq \epsilon) & \geq 1 - \delta \\ |G|e^{-\epsilon n} & \leq \delta \\ n & \geq \biggl \lceil {log({|G| \over \delta}) \over \epsilon} \biggl \rceil \end{align*}

Summary

  • ERM picks a classifier that makes the smallest number of mistakes on observed data.
  • Define: Ln(g)=#{1jn:g(Xj)Yj}nL_n(g) = {\#\{1 \leq j \leq n\: : \:g(X_j) \neq Y_j\} \over n}
  • Remark: By Law of Large Number, Ln(g)L(g)L_n(g) \rightarrow L(g) since ELn(g)=L(g)\mathbb{E}L_n(g) = L(g)
  • The ERM states that one pick gn^\hat{g_n} that minimizes Ln(g)L_n(g) over gGg \in G